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Abstract 

The  MIT-LL/AFRL  MT  system  is  a  statistical  phrase- 
based  translation  system  that  implements  many  modern  SMT 
training  and  decoding  techniques.  Our  system  was  designed 
with  the  long-term  goal  of  dealing  with  corrupted  ASR  in¬ 
put  and  limited  amounts  of  training  data  for  speech-to-speech 
MT  applications.  This  paper  will  discuss  the  architecture  of 
the  MIT-LL/AFRL  MT  system,  improvements  over  our  2005 
system,  and  experiments  with  manual  and  ASR  transcription 
data  that  were  run  as  part  of  the  rWSLT-2006  evaluation  cam¬ 
paign. 

1.  Introduction 

In  recent  years,  the  development  of  statistical  methods  for 
machine  translation  has  made  usable  MT  a  real  possibility. 
Specifically,  advances  in  methods  to; 

•  Extract  word  alignments  from  parallel  corpora  [1][2] 

•  Learn  and  model  the  translation  of  phrases  [4]  [5] 

•  Combine  and  optimize  model  parameters  [6]  [7]  [8] 

•  Decode  and  Rescore  Test  data  [9]  [10] 

These  advances  have  helped  to  dramatically  increase  the 
quality  of  MT  output.  Our  2006 IWSLT  system  extends  these 
methods  and  work  we  did  in  2005  [18]. 

In  subsequent  sections,  we  will  discuss  the  details  of 
the  translation  system  including  our  alignment  and  language 
models  and  methods  we’ve  implemented  for  optimization 
and  decoding.  Specifically,  we  will  highlight  improvements 
and  changes  made  to; 

1 .  Better  utilize  the  larger  2006  training  set 

2.  Coverage  of  Italian  and  Japanese 

3.  Enhance  the  coverage  of  extracted  phrases 

Uhis  work  is  sponsored  by  the  Air  Force  Research  Laboratory  under 
Air  Force  contract  FA8721-05-C-0002.  Opinions,  interpretations,  conclu¬ 
sions  and  recommendations  are  those  of  the  authors  and  are  not  necessarily 
endorsed  by  the  United  States  Government. 


Eigure  1;  Basic  Statistical  Translation  Architecture 

4.  Better  models  and  better  decoding 

5.  Increase  gains  from  rescoring  n-best  lists 

As  this  year’s  evaluation  conditions  have  changed,  our 
basic  translation  training  and  decoding  processes  have  been 
adapted  accordingly,  as  shown  in  Eigure  1.  Boxes  in  grey 
have  not  changed  substantially  since  2005.  Refer  to  [18]  for 
more  detail  regarding  the  implementation  of  these  modules. 

We  submitted  systems  for  Chinese,  Japanese  and  Italian- 
to-English  language  pairs.  In  each  case,  we  used  only  the 
supplied  data  for  each  language  pair  for  training  and  opti¬ 
mization.  Erom  these  data,  we  extract  word/character  align¬ 
ments.  These  alignments  are  then  expanded  using  slightly 
modified  versions  of  standard  heuristics.  This  process  is  de¬ 
scribed  in  detail  in  Section  2.  Phrases  are  then  extracted  and 
counted,  and  the  resulting  phrase  table  is  then  used  for  de- 
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Figure  2:  A  Factor-based  Consistency-Checking  Model 

coding  and  rescoring.  Language  models  are  trained  using 
the  English  side  of  each  language  pair. 

Using  development  bitexts  separated  from  the  training 
set,  we  then  employ  a  minimum  error  rate  training  process 
to  optimize  model  parameters  utilizing  a  held-out  develop¬ 
ment  set.  These  trained  parameters  and  models  can  then  be 
applied  to  test  data  during  decoding  and  rescoring  phases  of 
the  translation  process. 

2.  Data  Preprocessing 

For  Chinese  and  Japanese  texts,  we  used  the  supplied  UTF- 
8  encodings  and  converted  all  roman  characters  into  ASCII. 
We  used  Latin-1  encoding  for  all  Italian  texts.  Source  and 
target  side  training  texts  are  lower-cased  before  training. 

Because  this  year’s  evaluation  data  (and  devset  4)  in¬ 
cluded  no  source  punctuation,  we  implemented  a  source- 
language  repunctuator  to  better  match  the  training  data. 

3.  Improved  Word/Character  Alignments 

In  this  year’s  system  we  employed  multiple  word  and  char¬ 
acter  alignment  strategies,  extending  the  method  described 
in  [19].  For  all  language  pairs,  we  combine  alignments  from 
IBM  model  5  see  [1]  and  [3]  and  alignments  extracted  using 
the  competitive  linking  algorithm  (CL A)  described  in  [20]. 
We  apply  a  simple  likelihood  function,  though  we  found 
only  minor  differences  between  this  function  and  others  that 
have  been  proposed  in  the  literature  [21].  Phrases  were  ex¬ 
tracted  from  both  types  of  alignments  and  combined  in  one 
phrase  table. 

Additionally,  for  Chinese-to-English  translation,  both 
word  and  character  segmentation  were  for  training  CLA  and 
GIZA  alignment  models.  Phrases  were  then  extracted  from 
all  four  alignments  and  combined.  Word  segmented  phrases 
were  resegmented  into  characters  before  counting. 

4.  Improved  Translation  Models 

Following  the  2006  JHU  summer  workshop  we  conducted 
a  number  of  experiments  with  factored  translation  models 
using  our  training/decoding  paradigm.  To  this  end  we  in¬ 
tegrated  the  mo  s  e  s  decoder  into  minimum  error  rate  train¬ 
ing  decoding  processes.  This  allowed  us  to  try  two  differ¬ 
ent  factor-based  approaches  to  the  IWSLT  Chinese-English 
translation  task. 

Factored  translation  models  extend  standard  phrase- 
based  statistical  models  by  representing  words  as  vectors 


Figure  3:  A  Parallel  Word  Class/Surface  Translation  Model 

of  factors.  This  representation  can  be  used  to  decompose 
words  into  constituent  parts  (e.g.  lemma  H-  affix)  for  the 
purpose  of  modeling  them  separately,  or  as  generalizing 
words  into  larger  linguistic  “classes”  (e.g.  part-of-speech). 
From  a  factored  representation,  it  is  possible  to  train  stan¬ 
dard  statistical  models  that  are  then  combined  using  stan¬ 
dard  log-linear  assumptions  in  which  feature  functions  of  the 
form  hpACTORki.^i...i^fi...j)  represent  translation  likeli¬ 
hoods  that  are  specific  to  factor  k  and  special  generation  fea¬ 
tures  hgen{F ACTORkici),  F ACTORi{ei))  that  represent 
the  likelihood  of  generating  F ACTORk  from  F ACTORi. 

Because  we  did  not  have  access  to  analysis  tools  in  Chi¬ 
nese  during  the  IWSLT  evaluation,  we  chose  to  create  mod¬ 
els  using  automatically  derived  word  classes  (as  generated  by 
mkcls).  In  our  experiments  words  are  represented  both  by 
their  surface  form  and  by  their  associated  word  classes. 

Using  this  representation  we  trained  two  different  mod¬ 
els: 

•  Consistency-Checking  Model  -  Translate  source  sur¬ 
face  forms  to  target,  generate  word  classes  for  each 
target,  then  apply  a  class-based  LM. 

•  A  Parallel  Translation  Model  -  Translate  both  source 
surface  forms  and  word-classes  to  target  word/class 
pairs,  then  apply  a  class-based  LM. 

These  models  are  shown  schematically  in  Figure  2  and  Fig¬ 
ure  3,  respectively.  We  note  that  the  parallel  approach  is  quite 
similar  to  the  alignment  template  model  proposed  in  [22] 
with  an  additional  surface-to-surface  form  translation  model. 
These  models  were  not  applied  in  time  for  official  submission 
to  the  2006  evaluation,  but  in  post-evaluation  experiments  we 
found  these  models  to  be  quite  helpful. 

5.  Improved  Decoding 

For  the  2006  evaluation  we  used  a  combination  of  two  de¬ 
coders:  our  in-house  decoder  mtdecoder  and  the  moses 
decoder  developed  as  part  of  the  2006  JHU  summer  work¬ 
shop.  For  both  decoders  we  found  it  advantageous  to  use 
4-gram  and  5-gram  language  models  in  decoding.  Our  of¬ 
ficial  submissions  for  Chinese,  Japanese  and  Italian  use  4- 
gram  Interpolated  Knesser-Ney  models  trained  using  the  SRI 
Language  Modeling  Toolkit  [12]  [13]  [14]. 

Using  our  decoder  we  implemented  three  types  of  re¬ 
ordering  constraints,  revisiting  work  done  in  [24]  with  the 
IWSLT-2006  data.  We  explored  both  ITG  [25]  and  IBM  con¬ 
straints,  and  the  results  shown  in  Section  6  indicate  that  dif¬ 
ferent  reordering  constraints  don’t  decrease  the  BLEU  score 
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Figure  4:  An  example  of  a  disallowed  reordering  using  IBM 
constraints 


Disallowed 


Figure  5:  An  example  of  a  disallowed  reordering  using  ITG 
constraints 


P{L\h...j)^  Y,  P{ei...L\h...j)  (1) 

{e  I  \e\=L} 

Similarly  IBM  model  1  scores  can  be  computed  for  each  n- 
best  list  entry: 


Pibml  ~  7  n^p(/7|ei)  (2) 

j  —  1  *  —  1 

6.1.  Development  Experiments 

In  preparation  for  the  arrival  of  the  official  evaluation  data, 
we  conducted  experiments  with  our  system  using  dev4  in 
each  of  the  language  pairs.  For  these  experiments  we  set 
aside  devl  for  minimum  error  rate  training. 


significantly  in  most  language  pairs  while  reducing  decod¬ 
ing  time  by  20-50%.  Both  constraints  disallow  certain  reord- 
ing  configurations.  Figures  5  and  4  offer  examples  of  these 
configurations.  Details  of  these  experiments  are  described 
in  [23]. 

6.  Results 

As  in  2005,  we  employ  minimum  error  rate  training  to  opti¬ 
mize  model  scaling  factors  for  both  decoding  and  rescoring 
features.  In  this  year’s  evaluation,  we  added  5-gram  rescor¬ 
ing  language  models  and  6-gram  class-based  rescoring  lan¬ 
guage  models  after  decoding.  After  the  evaluation  we  added 
sentence  length  posterior  features  for  rescoring.  A  full  list  of 
the  feature  functions  used  in  our  system  is  shown  in  Table  1. 

We  approximate  sentence  length  posteriors  from  the  n-best 


Decoding  Features 

P{f\e) 

P{e\f) 

LexW{f\e) 

Lexw\e\f) 

Phrase  Penalty 
Lexical  Backoff 
Word  Penalty 
Distortion 

P(e)  -  4-gram  language  model 
Rescoring  Features 
P{e)  -  5-gram  LM 
P(e)  -  6-gram  class-based  LM 
PModeii{f\s)  -  IBM  model  1  translation  probabilities 
Sentence-length  posterior' 


6.1.1.  Segmentation  and  Alignment 

For  different  language  pairs  we  employ  different  segmenta¬ 
tion  techniques.  We  use  basic  word  segmentation  for  Italian, 
combining  phrases  extracted  from  IBM  model  5  alignments 
with  CLA  alignments.  For  Japanese,  we  found  it  optimal 
to  use  word  segmentation  with  character  segmentation  back¬ 
off  (use  character  extracted  phrases  for  OOV  words)  with 
CLA  alignments.  In  the  Chinese  case,  we  use  both  word  and 
character  segmentation.  From  both,  we  compute  both  CLA 
and  IBM  model  5  alignments  and  extract  phrases  that  are 
then  normalized  to  character  segmentation  when  aggregating 
counts. 

Tables  2,  3  and  4  show  a  summary  of  results  for  various 
configurations  of  segmentation  and  alignment. 


Configuration 

BLEU 

Character  Segmented 

21.24 

Word  Segmented 

21.01 

Chard- Word  Segmented 

21.21 

Char-rWord  Segmented  +  CLA 

22.18 

Table  2:  Segmentation/alignment  results  for  Chinese  ( dev4 ) 


Configuration 

BLEU 

Word  Segmented 

23.63 

Word  Segmented  -i-  Character  Backoff 

23.82 

Word  Segmented  -i-  CLA 

23.34 

Word  Segmented  +  Character  Backoff  +  CLA 

24.28 

Table  3:  Segmentation/alignment  results  for  Japanese 
(dev4) 


Table  1:  Feature  functions  used  in  the  translation  model 


*  features  added  after  the  official  submission 


6.1.2.  Rescoring 

In  addition  to  standard  features  that  we  use  during  decoding, 
we  introduce  a  number  of  additional  features  for  rescoring 


Configuration 

BLEU 

Word  Segmented 

Word  Segmented  +  CLA 

35.13 

37.40 

Table  4:  Segmentation/alignment  results  for  Italian  ( dev4  } 


n-best  lists  generated  by  our  decoder  (or  moses).  For  the 
2006  evaluation  we  tried  a  number  of  new  features,  includ¬ 
ing  longer  context  LMs  (text  and  class-based),  IBM  model 
1,  unigram  posteriors  and  sentence  length  posteriors.  Empir¬ 
ically,  we  found  that  all  features  with  the  exception  of  uni¬ 
gram  posteriors  were  beneficial.  As  shown  in  Table  5  rescor¬ 
ing  is  helpful  when  testing  on  dev4  for  all  language  pairs, 
though  it  varies  widely  (from  3.32%  to  10.76%  relative  im¬ 
provement). 


BLEU 

Configuration 

Chinese 

Japanese 

Italian 

Baseline  4-gram  Decode 

21.39 

21.92 

36.92 

+  5-gram  rescore  EM 

21.55 

- 

- 

+  6-gram  class-based  EM 

21.52 

- 

- 

+  Model  1 

21.86 

+  Sentence  Length  Posterior 

22.10 

- 

- 

+  ALL 

22.10 

24.28 

37.40 

Table  5;  Rescoring  results  for  all  languages  ( dev4 ) 


6.1.3.  Pre/Post-Processing 

During  the  evaluation,  we  explored  different  pre  and  post¬ 
processing  options  to  optimize  this  year’s  official  evaluation 
criterion  (mixed-case,  with  punctuation,  no  source  punctua¬ 
tion  provided).  We  tried  two  different  methods  of  producing 
target  punctuation:  1 )  training  asymmetric  models  by  remov¬ 
ing  source  punctuation  from  train  and  development  corpora, 
and  2)  repunctuating  source  sentence  in  the  supplied  devel¬ 
opment  and  test  corpora. 


BLEU 

Configuration 

Chinese 

Japanese 

Italian 

Remove  Source  Punctuation 

w/4-gram  TrueCase  EM 

21.86 

23.14 

36.64 

Repunctuate  Source 

w/3-gram  TrueCase  EM 

21.93 

- 

- 

w/4-gram  TrueCase  LM 

22.10 

24.28 

37.40 

w/5-gram  TrueCase  LM 

22.10 

- 

- 

Table  6:  Effects  of  different  pre/post-processing  methods 
(dev4) 

To  produce  mixed-case  output,  we  applied  implemented 
an  HMM-based  truecasing  model  as  proposed  in  [26]: 


Figure  6:  Performance  of  two  factored  models  with  different 
class-LM  contexts 


w*i...k 


arg  max  P(wi.. 
arg  max 


where  a  standard,  interpolated  language  model  approxima¬ 
tion  is  used  as  in: 


3 

« n  P(^Wk  — 1  ■  ■  ■  (3) 

k^l 

and  an  approximate  table  of  conditonal  emission  probabili¬ 
ties  is  represented  by; 


3 

^  P{sk\wk)  (4) 

k^l 

As  shown  in  Table  6,  automatic  repunctuation  of  the  in¬ 
put  source  is  benehcial  in  performance  terms.  Similarly, 
small  gains  can  be  had  by  choosing  the  appropriate  language 
model  order  for  TrueCasing. 


6.1.4.  Factored  Models 

After  the  official  evaluation  deadline,  we  ran  a  number  of  ex¬ 
periments  to  explore  the  performance  of  the  factored  models 
described  in  Section  4.  Our  experiments  focus  on  a  base¬ 
line  Chinese-to-English  system  trained  using  only  word  seg¬ 
mentation  and  optimized  as  described  above.  Due  to  time 
constraints,  we  did  not  perform  the  rescoring  described  in 
Section  6.1.2.  With  this  configuration,  our  baseline  system 
achieve  a  BLEU  score  of  19.60  on  dev4  with  the  official 
evaluation  criteria. 

As  shown  in  Figure  6,  both  factored  approaches  achieve 
substantial  gains,  though  the  Consistency  Checking  model  is 
consistently  better  than  both  the  baseline  and  the  Parallel 
Translation  model.  This  approach  equals  the  performance 
of  our  best  rescoring  model  on  dev4  despite  starting  from  a 
worse  baseline. 

We  have  seen  that  limitations  in  the  current  implementa¬ 
tion  of  moses  may  cause  search  errors  in  our  parallel  trans¬ 
lation  models.  Despite  current  limitations,  our  parallel  mod¬ 
els  offer  some  advantage. 


6.1.5.  Decoder  Reordering  Constraints 


BLEU/Time  (secs) 

Configuration 

Chinese 

Japanese 

Italian 

free 

20.32/3509.5 

22.35/3309.7 

35.85/90.6 

IBM 

19.85/2961.0 

21.46/2969.3 

35.52/36.2 

ITG 

19.85/2961.0 

21.37/1868.7 

35.52/36.2 

Table  7:  Performance  of  different  reordering  constraints 
(  dev4 ) 

Although  we  did  not  use  ITG  or  IBM  reordering  con¬ 
straints  in  our  official  submissions,  some  development  exper¬ 
iments  with  these  constraints  did  yield  gains.  Unfortunately, 
these  gains  were  not  consistent  across  dev  sets.  Table  7 
shows  the  performance  of  different  reordering  constraints  in 
constrast  to  the  default  of  free  reordering. 

Gains  in  processing  time  are  quite  apparent.  20-60%  im¬ 
provement  in  speed  can  be  had  with  minimal  BLEU  score 
impact  using  these  reordering  constraints.  More  detailed  ex¬ 
periments  with  these  constraint  can  be  found  in  [23]. 

7.  Evaluation  Results  and  Analysis 


Text  Input 

BLEU 

Configuration 

Chinese 

Japanese 

Italian 

Optimized  (dev4) 

21.57 

20.99 

35.74 

Optimized  (devl) 

20.66 

20.24 

34.40 

Optimized  (dev4)  -  No  Rescoring 

21.27 

- 

- 

Table  8:  Overall  performance  of  submitted  systems  with  text 
input  ftest-2006j 


ASR  Input 

BLEU 

Configuration 

Ch  (Read) 

Ch  (Spon.) 

Japanese 

Italian 

1-best,  Opt  (dev4) 

18.61 

16.57 

18.91 

27.98 

10-best,  Opt  (dev4) 

17.42 

16.57 

- 

28.81 

1-best,  Opt  (devl) 

18.46 

- 

18.43 

27.64 

Table  9:  Overall  performance  of  submitted  systems  with  ASR 
input  ftest-2006j 

Tables  8  and  9  show  our  official  submissions  to  the 
2006  IWSLT  evaluation.  Official  primary  submissions  are 
shown  in  bold.  Each  primary  system  performed  well,  rank¬ 
ing  3rd/4th  in  ASR  BLEU  scores  and  2nd/4th  in  text  BLEU 
scores  among  submitted  systems.  Note  that  our  primary  sys¬ 
tem  was  not  always  best  (e.g.  Italian  ASR  condition).  Our 
primary  submissions  were  optimized  using  dev4.  These 
submissions  processed  1-best  ASR  input  and  reference  tran¬ 
scription.  Our  secondary  submissions  decoded  10-best  from 
the  ASR  lattice,  merging  MT  n-best  lists  and  rescoring  with 
ASR  features  as  described  in  [18]. 


Reruning  our  system  using  the  2005  train/dev/test  para¬ 
digm,  we  found  that  our  system  gained  over  4  BLEU  points 
(8.7%  relative  improvement)  with  respect  to  our  previous 
best. 

Our  next  steps  include  further  development  of  our  in- 
house  decoder  and  experiments  with  factored  models  using 
better  baselines  and  better  search  methods. 
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